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Abstract 

Normalization procedures are widely used in high-throughput genomic data analyses to remove various technological noise 
and variations. They are known to have profound impact to the subsequent gene differential expression analysis. Although 
there has been some research in evaluating different normalization procedures, few attempts have been made to 
systematically evaluate the gene detection performances of normalization procedures from the bias-variance trade-off point 
of view, especially with strong gene differentiation effects and large sample size. In this paper, we conduct a thorough study 
to evaluate the effects of normalization procedures combined with several commonly used statistical tests and MTPs under 
different configurations of effect size and sample size. We conduct theoretical evaluation based on a random effect model, 
as well as simulation and biological data analyses to verify the results. Based on our findings, we provide some practical 
guidance for selecting a suitable normalization procedure under different scenarios. 
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Introduction 

High-throughput data such as microarray expression data and 
RNA-seq data have become an indispensable tool for medical 
research nowadays. Typically, a data pre-processing step called 
normalization is conducted prior to the subsequent statistical 
analyses in order to remove various systematic noise. Pertinent 
statistical significance tests are applied to these normalized gene 
expression levels. Parametric tests such as Student's /-test and its 
improvements such as the one used in SAM (Significance Analysis 
of Microarray, [1]) are widely used. Non-parametric tests such as 
Wilcoxon rank-sum test serve as distribution free alternatives. The 
resulting />-values are adjusted by a multiple testing procedure 
(MTP) in order to control certain quantity of per-family Type I 
error, such as familywise error rate (FWER) [2—5] and false 
discovery rate (FDR) [6]. Differentially expressed genes are 
identified based on a pre-specified threshold of adjusted ^-values. 
More detailed introduction of statistical methods for detecting 
differentially expressed genes can be found in [7-10]. 

There are different sources of technological noise in high- 
throughput genomic data [11,12]. Over the past decade or so, 
many normalization procedures have been proposed to remove 
such noise, and they can be loosely categorized into within-array 
normalizations and multiple-array normalizations. Within-array 
normalizations remove noise by "borrowing information" from 
gene expressions within a single array. A simple example is the 



global normalization, which applies a constant adjustment to force 
the distribution of gene expressions to have a common mean or 
median within each array [13,14]. Another example is the rank 
normalization which replaces each observation by its fractional 
rank (the rank divided by the total number of genes) within array 
[14,15]. This normalization procedure achieves more robustness 
to non-additive noise compared with global normalization at the 
expense of losing some parametric information of expressions. 

Multiple-array normalizations adjust the scale across arrays to 
avoid different arrays having undue weight. A typical example is 
the quantile normalization. Motivated by quantile-quantile plot, it 
makes the empirical distribution of gene expressions pooled from 
each array to be the same [16]. In particular, quantile normalized 
arrays must have the same sample mean/median. In this sense, it 
is stronger than global normalization. On the other hand, it does not 
change the rankings of genes and retains more parametric 
information than the rank normalization. So it can be viewed as 
a compromise between the global and rank normalization 
procedures. 

A data-driven variable transformation called the <5-sequence 
method [17,18] can serve as an alternative to the aforementioned 
normalization procedures. In this procedure, essentially each gene 
is normalized by another one witii similar variance which acts as 
the "reference gene". This normalization is local because only one 
gene is needed to normalize a given gene. Interested readers are 
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referred to [19,20] for background and more detailed reviews of 
normalization procedures. 

Surrogate variable analysis (SVA) [21] was designed to 
overcome the problems caused by heterogeneity in expression 
data. The rationale of this method is to remove the detrimental 
effects due to unmodeled variables such as demographic, 
environmental, and technical factors. 

Although SVA was not originally designed as a normalization 
procedure, it can serve this role as long as the array-specific 
technical noise is considered as one of the unmodeled factors. To 
some extend, SVA can be viewed as an extension of the global 
normalization, which normalizes the the observed expression 
levels by removing just one factor, per-array mean expressions. 

Normalization procedures are designed to remove technological 
noise and improve the detection of differentially expression genes. 
However, the testing power improvement from normalization 
procedures may also come with a price. The performance of a 
particular normalization procedure in terms of gene selection 
power and type I error control is heavily influenced by the sample 
size and true effect size of gene differentiation. Some efforts have 
been made to evaluate different normalization procedures [16,22- 
25], but few attempts have been made to systematically evaluate 
the impact of sample size and effect size on the gene detection 
performances of normalization procedures. 

In this paper, we conduct a thorough study to evaluate the 
performance of gene selection strategies which consist of four 
normalization procedures (plus the case without normalization) 
combined with three statistical tests and two MTPs, with various 
combinations of sample sizes and effect sizes. It is well-known that 
variance reduction always comes with the price of bias. We use this 
bias-variance trade-off principle to study the theoretical statistical 
properties of normalization procedures. Simulation and biological 
data analyses are conducted to support the theoretical results in 
the biological application. Based on our study, we make several 
concluding points in the Section Discussion. Whenever possible, 
we provide theoretical explanations to support these conclusions 
based on a random effect model [15]. We hope these findings can 
provide biomedical researchers with some practical guidance for 
selecting the best gene selection strategy. 

The following supporting materials are presented in File SI: 1 . a 
brief introduction of the A-test; 2. theoretical derivations for the 
main results in the Methods Section; 3. additional simulation 
results; 4. additional results of biological data analyses. 

Methods 

Model for Gene Expression Data 

In this paper, we assume all expression levels are log- 
tranformed. Additive (non-additive) noise are thus multiplicative 
(non-multiplicative) noise in terms of the original expression levels. 
For convenience, the words "gene" and "gene expression" are 
used interchangeably to refer to these log-transformed random 
variables. 

Let c = A,B be two different phenotypic groups, m be the total 
number of genes, and n be the number of arrays sampled from 
each phenotypic group. Without loss of generality, phenotypic 
group A is set to represent the phenotype of interest (usually the 
disease or the treatment group) and group B the normal 
phenotype. So up (down) regulation of a gene refers to its over 
(under) expression in group A. We denote by y c t , the observed 
expression level of the rth gene recorded on the y'fh array sampled 
from the fth phenotypic group. Conceptually speaking, the 
variation of the observed expression, y 0 .,, can be decomposed into 
two sources: a) the biological variation of the ith gene pertain to 



the y'th sample, denoted by Xy (unobservable); b) the array-specific 
noise due to imperfect measurement technology, denoted by a? 
(unobservable). 

We denote E(^) =\x c i and var(;yp = (X? . For a given gene, its 

true effect size, or the expected (geometric) mean difference of 
expressions between two phenotypes, is written as e, : = fif — /-tf. 
We also assume that the (unobservable) true correlation coefficient 
between two noise-free expression levels is p l -j= corr(x^,Xy.) and 
the correlation coefficient between observed gene expressions i 
and j is . = corr{y c l ,y C j.). For the biological data (see Section 
"Biological Data" and [26]) without any normalization proce- 
dures, plj is very high. The sample mean of Pearson correlation 
coefficients of all gene pairs computed from all three sets of 
biological data (see Section "Analyzing Biological Data" for more 
details) are close to 0.9. A histogram of such correlation 
coefficients is provided as Figure SI in File SI. 

This high correlation may come from two sources [27]: a) the 
biological correlation of the rth and y'th gene p?; b) the 
technological noise a c . 

We divide genes into three sets: 

• Go, the set of non-differentially expressed genes (abbreviated as 
NDEGs). For all ieG 0 , e, : =nf-fif = 0; 

• Gj + , the set of up-regulated genes. For all ieGf , e,->0; 

• Gf , the set of down-regulated genes. For all feGf , e, <0. 

The set of differentially expressed genes (abbreviated as DEGs) 
is the union of both up-regulated and down-regulated genes, 
which is denoted by Gi = Gj + U Gf . We write the size of these 
gene sets by wo = |Go|, wi l + =]G l + |, mf =|Gf |, and m\=\G\\. 
Apparendy m\ =mf +Wj~ and mo + m\ =m. 

Normalization Procedures 

The following normalization procedures are studied in this 
paper: 

1. Global normalization (GLOBAL): Subtract each element of 
the data matrix by the mean over all gene expression signals on the 
array to which this element belongs [13,14]. The normalized gene 
expressions are: 

fi=y%-f v (i) 

where y?j = 1 Yl= i V% an d 7 = 1,2,... ,n, 

2. Rank normalization (RANK): Replace every gene expression 
in one array by its rank in the (ascendingly) ordered array divided 
by the total number of genes. Denote r| as the rank of y^ in the 
array to which it belongs, the normalized gene expressions are. 



This method was proposed by [15] and discussed further in [14]. 

3. Quantile normalization (QUANT): First, a reference array of 
empirical quantiles, denoted as q = {q\,qi-, ■ ■ ■ ,q m ), is computed by 
taking the average across all ordered arrays. Let 
y c ^y<y^ 2 y< ■ ■ ■ <y^ m y denote the ordered gene expression 
observations in the y'th array (/' = 1 ,2, . . . ,n) of the c th (c = A,B) 
group, the rth (r= 1,2, . . . ,m) element of this reference array is. 
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(3) 



Next, the original expressions are replaced by the entries of the 
reference array with the same rank. The normalized gene 
expressions are. 



= <5V C . : 

!/ 



1 

2n 



(4) 



We refer the reader to [16] for more details. 

4. ^-sequence (DELTA): Sort all genes by their sample 
variances and denote the sorted array by (y c 0l j,y c 02 j, • • • >XL/)> 
where a (y c 0l j)<a Hy c ov )< ■■■ <a (y c 0mJ )- This step is to ensure 
similar variances between two consecutive genes. Starting with the 
first gene in the given gene ordering, record the differences 
between two successive genes: 



~ y °2k-\i y "2kJ' 



(5) 



where k= 1,2, . . . , ™. Here m is assumed to be an even number for 
simplicity. More technical details can be found in [17,18]. One 
interpretation of this step is that each gene is normalized by a 
"reference gene" with similar variance. This normalization is local 
because it only involves one gene to normalize another. Next, 
select candidate gene pairs by applying an appropriate hypothesis 
test and an MTP to d* k c - . In order to make the <5-sequence method 
directly comparable to other methods, the following ad hoc method 
was proposed [17] to "break the pairs". Starting with the second 
gene in the given gene ordering, record the differences between 
two successive genes (the last gene is paired with the first one). 
Select another set of candidate gene pairs based on these new 
differences. Report the intersection of the two gene sets (unpaired 
genes) as a final list of differentially expressed genes. More details 
can be found in [17,18]. 

5. Surrogate variable analyses (SVA): in this approach, the 
observed gene expression is modeled as 



= H C i+ Y,Viiglj + e*j, 
1=1 



(6) 



where gy represents an arbitrary function of the /th unmodeled 
factor on the y'fh array, 5Tjf_ l YliSlj represents dependent variation 
across genes due to those unmodeled factors, and e*j represents the 
"true" independent noise that are specific to the /th gene and y'th 
array. The designing goal of SVA is to estimate and remove 
5Tjf= l YliSlj from the observed expression levels, so that the 
subsequent statistical analysis will not be influenced by those 
unmodeled factors. 

Since 5Tjf_ l YliSlj ' s typically not direcdy observable, SVA uses 
the following alternative model to remove the effects of unmodeled 
factors 



(7) 



Gene differential analysis based on SVA is typically done in this 
way. In the first step, expression levels are fitted by two linear 
models. The first one is a null model which only includes the 
surrogate variables; the second one is a full model which includes 
both surrogate variables and group labels. Two choices are 
available: a) ordinary linear regression based on least squares; b) a 
modified linear regression method implemented in LIMMA. An 
i**-test can then be used to test whether the group labels are 
significant. More technical details can be found in [21]. 

Due to the nature of the SVA method, a combination of SVA 
with linear regression is comparable to the two-sample /-test 
together with other normalization methods; SVA/LIMMA is 
comparable to LIMMA with other normalization procedures. 
Since the SVA method is strongly tied with linear regression 
methods, we chose not to combine Wilcoxon rank-sum test and At- 
test with it in this study. 

6. As a comparison, we also study the properties of gene 
selection procedures without normalization (NONE). 

The Bias-variance Trade-off of Normalization Methods 

The choice of normalization procedure has a profound impact 
on the subsequent analyses. It can alter both the testing power and 
type I error of the gene selection procedure. In this section, we 
evaluate the impact of various normalization procedures on testing 
power and type I error control from the view point of bias- 
variance trade-off. 

To simplify theoretical derivation, we assume that the mean 
expression levels in the normal phenotype (group B) are zeros 
(flf = 0). This assumption implies that flf = fif + ej = ej. This 
simplification is reasonable because all three hypothesis testing 
procedures studied in this paper (t-test, Wilcoxon rank-sum test, 
and Attest) are invariant under shift transformation, so the mean 
difference between two groups, e,, is much more important than 
the normal level of gene expressions. For derivation simplicity, we 
also assume that the effect sizes of all up(down)-regulated genes are 
the same. In summary, we have 



<*)■ 



e+ c = A, feGj + , 

e~ c = A, /eGf , 

0 c = A, ieGo, 

0 c = B, 



(8) 



where e + >0 and e~ <0 are the effect sizes of up- and down- 
regulated genes. 

Below we list the bias increase and variance reduction induced 
by normalization procedures. Theoretical derivations of these 
results can be found in Section 2 of File SI. 

Global normalization. By computing the expectation of 
global normalized expressions for both phenotypes, we find that 
the global normalization introduces a small bias 

-EyCj = 

group difference for global normalized data are 



m ' c e . More specifically, the expected mean 



m 7~ e t + ™ e 



feGf, 
ieG 0 . 



(9) 



where htj, k=l, . . . ,K, K<L, are orthogonal vectors that span 
the same linear subspace as gys do. 



This is because genes are normalized with respect to the array 
means, which is computed from both DEGs and NDEGs. 
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GLOBAL can reduce the variance of gene expression. 
Specially, if cr? ss a 2 and p| ~ p for all genes, we have 



and 



var(j«); 



f 7 2 (l-p)< ( J 2 , 



which shows that the global normalization can reduce the variance 
of expressions, especially when p is close to 1 . 

This variance reduction is confirmed in both simulations and 
biological data analyses. In SIMU3 (see Section Simulation 
Studies for more details) with sample size 10 and effect size 1.8, the 
average sample variance of expression levels is 0.1286 before 
global normalization and 0.01367 after. For the biological data 
HYPERDIP with sample size 10, the average sample variances 
before and after global normalization are 0.1514 and 0.01395, 
respectively. 



When 



the proportion of DEGs among all genes, is 



sufficiently small, the bias introduced by the global normalization 
is negligible because 



+m, e mt +m, , , 
i |<— ! l -msLx(e + ,-e ). 



Alternatively, this bias can be negligible if a) the numbers of up 
and down regulated genes are similar (w, ~WJ~); b) up and down 
regulations induce similar differential expression (e + « — e~). We 
call this case the balanced differential expression structure henceforth. In 
either case, we predict that the testing power will be improved 
because the induced bias is small and the variances of global 
normalized genes are reduced. 

The bias effect can be substantial if the differential expression 
structure is highly unbalanced. For example if most DEGs are up- 
regulated (»i 1 + »mj~, mj + sifni), the mean expression difference 

m, + e + +mfe~ m x e + 

tor an up-regulated gene is reduced by — « 

m m 

after the global normalization. Based on these considerations, we 
predict that the testing power in the balanced structure case will be 
better than that in the unbalanced structure case. 

mt e, + + m7e~ 

From Equation 9, a false effect size — is 

m 

introduced by GLOBAL for the NDEGs. This false effect size can 
become substantial when the effect size or the sample size 
increases. As a result, we predict that many NDEGs will be falsely 
declared as DEGs. We also predict that the resulted false positives 
are more pronounced in unbalanced structure than in balanced 
structure. 

Quantile normalization. Like the global normalization, the 
quantile normalization can also increase bias and reduce variance 
of the gene expressions simultaneously. 

We conduct thorough investigation of this bias and discover that 
it comes from two different sources, namely the rank skewing effect 
and the averaging effect (see Section 2, File SI). Based on our 
derivations, the expected group differences for quantile normal- 
ized expressions are 



-yf) 



2m Q 



„+e+- 



feGf, (10) 
ieG 0 . 



where «5i = m TE"L mf+mo + iE(v^ > .) 

^ = ^E';iiE(yf rV ). 

The effect size of DEGs decreases if e + and |e~| are 
substantially larger than both <5i and 82. Such effect size decrease 
is confirmed in our simulation. Take SIMU3 (see Section 
Simulation Studies for more details) with sample size 10 and true 
effect size 1 .8 as an example, the average sample mean difference 
ei=y1 A -yf for up(down)-regulated DEGs is 1.803 (-1.7966) 
before quantile normalization and 0.9901 (-1.0383) after. The 
average sample mean differences for NDEGs before and after 
quantile normalization are 0.0034 and —0.0198, respectively. 

The variance of y\j is complex and deserves further theoretical 
investigation. Empirical evidence shows that the quantile normal- 
ization has very good variance reduction capability. For example, 
the average sample variances of SIMU3 with sample size 1 0 and 
effect size 1.8 before and after quantile normalization are 0.1286 
and 0.06378, respectively. Similarly, the average sample variances 
of biological data (HYPERDIP, sample size 10) before and after 
quantile normalization are 0.1514 and 0.0086, respectively. The 
variance reduction of GLOBAL is better than that of QUANT in 
SIMU3 since GLOBAL is designed for additive noise structure. 
On the other hand, the variance reduction of QUANT is better 
than that of GLOBAL in real biological data since QUANT is 
robust to non-additive noise structure. 

Like the global normalization, the benefit of variance reduction 
induced by the quantile normalization outweighs the adverse effect 
of bias introduced by this procedure when e + , — e~, and n are 
small. This can be seen from simulation results in Figures 1, 2, and 
3(a, c). Take Figure 1 (a, c) as an example, the testing power with 
quantile normalization increases as the effect size increases given 
that the sample size is small. However, gene selection strategies 
based on the quantile normalization is outperformed by those 
without normalization when effect size reaches a certain point 
(around effect size 1.4). 

According to Equation (6) in File SI and Equation (10), the 

effect size of DEGs is reduced for both balanced and unbalanced 

differential structures. For DEGs, their effect size after quantile 

e + e~ 
normalization has three parts: a) — (up-regulated) or — (down- 



regulated); b) 



mt e + + m, e 



2 

8\ 82 
7 -; c) — (up-regulated) or — (down- 

regulated). The first part is clearly independent of the structure of 
differential expression. The third part is small as compared to the 
other two parts according to Equation (6) in File S 1 . The second 
part is negligible for balanced differential structure but can be 
substantial for unbalanced differential structure. For example, if 



mj +m l e 
2m 



2m 



Consequently, we pre- 



dict that the testing power for data with balanced differential 
structure is better than those with unbalanced differential 
structure. 

For NDEGs, the quantile normalization introduces a bias 



m l + 8\+m 1 82 m l + e + +m ] e 



based on 



E(tf)-E(tf)*- 

Equation (10). This bias is more pronounced in an unbalanced 
structure. With large effect size or sample size, this bias can lead to 
nontrivial false positives. 

Rank normalization. Compared with the quantile normal- 
ization, the rank normalization goes even further in the 
nonparametric direction. Based on derivations in Section 2 of 
File SI, the expected group expression differences for rank 
normalized expressions have the following approximation 
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Figure 1. Number of true (a,c) and false (b,d) positives as functions of effect size (SIMU1). Total number of genes is m=1000. Total 
number of truly differentially expressed genes is +mf = 100, where and wf are the numbers of up- and down-regulated genes, respectively. 
The sample size is n=lO. f-test and Bonferroni procedure are applied. Adjusted y;-value threshold: 0.05. 
doi:1 0.1 371 /journal.pone.0099380.g001 



"'1 

2m 



_ \ 

2 



ieG+, 



ieG 



(11) 



2m 



ieG 0 . 

introduces a bias 
. This small bias is nonzero in 



From Equation (11), RANK 

E(^)-E(^- W ' ++mr 
y} 'i ' yy 'J ' 2m 

an unbalanced structure, which can lead to nontrivial false 

positives when the sample size is large. 

In a highly unbalanced gene differential expression structure, 

for example W[ the expected difference for up-regulated 

mo 

genes is approximately - — which is smaller than the expected 



difference in the balanced structure — —). Therefore we 

v 2m ' 

predict that the testing power in a balanced structure is higher 
than that in an unbalanced structure. 

The variance reduction effect of RANK comes from a very 
different mechanism as compared to the other normalization 
procedures. It removes the variance of noise by only preserving the 
ordering of observations. 

cj-sequence method. The bias increasing effects of DELTA 
comes from the imperfect "pair-breaking" method. Due to the 
nature of DELTA, a pairing breaking method must be used to 
produce list of individual DEGs. Specifically, a gene is labeled as 
DEG only if it belongs to two significant gene pairs. On average, 
the probability of a given NDEG to be paired with a DEG twice is 

2 

approximately — j. When this relatively rare event happens, an 
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Figure 2. Number of true (a,c) and false (b,d) positives as functions of sample size (SIMU2). Total number of genes is m= 1000. Total 
number of truly differentially expressed genes is +mf = 100, where m, + and mf are the numbers of up- and down-regulated genes, respectively. 
The effect size is e = 0.2. t-test and Bonferroni procedure are applied. Adjusted /;-value threshold: 0.05. 
doi:1 0.1 371 /journal.pone.0099380.g002 



artificial bias will very likely cause it to be mis-classified as DEG 
[17,18]. This effect is confirmed in the simulation studies. 

The variance reduction effects of DELTA come from the gene 
pairing and subtraction. This variance reduction is confirmed in 
both simulation studies and real data analyses. As an example, the 
average sample variances of data SIMU3 with sample size 1 0 and 
true effect size 1.8 before and after c5-sequence procedure are 
0.1224 and 0.0232, respectively. Similarly, the average sample 
variances of biological data (HYPERDIP, sample size 10) before 
and after c5-sequence procedure are 0.1514 and 0.0237, respec- 
tively. Figures 1 and 2 (a, c) show that the (5-sequence method 
improves the testing power compared with the non-normalized 
case when the sample size and effect size are both small, but the 
improvement is not as good as other normalization procedures. In 
a balanced structure, the expected expression difference can be as 
large as 2e compared to the maximum expected difference e in an 



unbalanced structure. Thus the testing power in a balanced 
structure is better than that in an unbalanced structure, ceteris 
paribus. When the sample size or effect size becomes large, the 
testing power of (5-sequence method can only reach approximately 

m, + +mr 

80%, which is determined by , the proportion of DEGs 

m 

among all genes. More details can be found in [17,18]. 

Unlike other normalization procedures (GLOBAL, QUANT, 
and RANK), we predict that strategies based on DELTA produce 

2 

m\ 

a stable number of false positives approximately equal to mo — 

m l 

which is independent of either the sample size or the effect size. This 
property of DELTA can be attractive when the effect size or the 
sample size is large. 

Surrogate variable analysis method. Since the SVA 
method is based on a complex linear model (Equation (7)), direct 
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Figure 3. Number of true (a,c) and false (b,d) positives as functions of sample size (SIMU3). Total number of genes is m= 1000. Total 
number of truly differentially expressed genes is +mf = 100, where w+ and mf are the numbers of up- and down-regulated genes, respectively. 
The effect size is e= 1.8. f-test and Bonferroni procedure are applied. Adjusted /;-value threshold: 0.05. 
doi:1 0.1 371 /journal.pone.0099380.g003 



derivation of mathematical expectation and variance of normal- 
ized expressions are difficult and deserve further investigation. 
Empirical evidences show that when combined with differential 
expression analysis, SVA increases statistical power while intro- 
duce more false positives. This empirical finding is consistent with 
the bias-variance trade-off effect of other normalization proce- 
dures. 

Hypothesis Testing Methods and Multiple Testing 
Procedures 

We apply the following tests to the normalized data to compute 
unadjusted ^-values (p,;, \<i<m): 

1 . /-test (t). 



2. A moderated f-test implemented in R/BioConductor package 
LIMMA [28] (LIMMA). 

3. Wilcoxon rank-sum test (Wilcox). 

4. A-test, a permutation test based on A-statistics with Euclidean 
kernel (Nstat). 

The third test is a multivariate nonparametric test which has 
been successfully used to select differentially expressed genes and 
gene combinations [14,29-31], differentially associated genes 
[32,33], and synergistic modulators [34]. A brief introduction of 
the A-test can be found in Section 1 of File S 1 . 

In order to control Type I errors, a suitable multiple testing 
procedure (MTP) must be applied to p, to compute adjusted p- 
values. The following two widely used MTPs are employed in this 
paper: 
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1 . Bonferroni procedure (BONF): The adjusted ^-values are 
pi= min (mpj,\). This procedure controls the familywise error 
rate (FWER). 

2. Benjamini-Hochberg procedure (BH): Let p ri <p ri < ■ ■ ■ <Pr m 
be the ordered raw p-values. The adjusted /(-values are 

p r . = minjc-i .. im -|min^— p ric , 1^ J-. This procedure controls 

FDR, the false discovery rate [6] . 

We present our results based on /-test and Bonferroni procedure 
in the main text. The results of other hypothesis testings and MTPs 
are similar and provided in Tables S1-S9 of File SI. 

Biological Data 

The biological dataset used in this study is the childhood 
leukemia dataset from the St. Jude Children's Research Hospital 
database [26]. We select three groups of data: 88 patients (arrays) 
with hyperdiploid acute lymphoblastic leukemia (HYPERDIP). 
79 patients (arrays) with a special translocation type of acute 
lymphoblastic leukemia(TEL) and 45 patients (arrays) with a T 
lineage leukemia (TALL). Since the original probe set definitions 
in Affymetrix GeneChip data are known to be inaccurate [35], we 
update them by using a custom CDF file to produce values of gene 
expressions. The CDF file was downloaded from http:// 
brainarray.mbni.med.umich.edu. Each array is represented by 
an array reporting the logarithm (base 2) of expression level on the 
set of 9005 genes. 

Results 

Simulation Studies 

To match the statistical properties of real gene expression more 
closely and mimic other noise sources such as non-additive noise, 
we apply resampling method to the biological data to construct the 
main simulated data, denoted by SIMU-BIO, as follows. We 
apply /-test to HYPERDIP and TEL (79 slides chosen from each 
set) without any normalization procedure or multiple testing 
adjustment. Under the significance level 0.05, 734 genes are 
declared to be DEGs with an unbalanced differential expression 
structure (677 up-regulated and 57 down-regulated). We record 
the mean difference across HYPERDIP and TEL for each DEC 
as its effectwww.ixinyiwu.com size (e,-). Then, we combine 
HYPERDIP and TEL data and randomly permute the slides. 
After that, we randomly choose In slides and divide them into two 
groups A and B of n slides each, mimicking two biological 
conditions without differentially expressed genes. Here the sample 
size n takes value in {10,20,40,60,79}. Finally, we add the 
recorded effect sizes to 734 genes (identified earlier) in group A. 
These 734 genes are defined as the DEGs in this simulation. 
Similarly, we test phenotypic differences between TALL and TEL 
(45 slides chosen from each set) and discover 546 DEGs with a 
balanced differential expression structure (259 up-regulated and 
287 down-regulated). We then apply the above resampling 
procedure to create simulated data with sample size n takes value 
in {10,20,30,40,45}. 

In addition, we conduct several simulation studies based on a 
widely adopted random effect model used in [15,36-38]: 

>>| !=l,2,...,m;7 = l,2,...,n, c=A,B. (12) 

In this model, a* represents variation that is the same for every 
gene and specific to the /'th array. While it is known that log 
transformation stabilizes variance for microarray data, more 



advanced variance stabilization transformation techniques 
[39,40] can achieve better uniformity of gene-specific variation. 

These simulated data can help us to gain better insight into the 
performance of different normalization procedures. First, we 
simulate several sets of data with additive noise based on a Each 
set of data has two groups of n arrays representing gene 
expressions under two phenotypic groups (group A and B). Each 
array has m = 1000 genes. For both groups, all genes are normally 
distributed with standard deviation c = 0.35 which is estimated 
from the biological data. The number of NDEGs and DEGs are 
set to be /Ko = 900 and mi = 100, respectively. The correlation 
coefficient between every two distinct genes is set to be p = 0.9, 
which is estimated from the biological data. For simplicity, we use 
the same effect size for up and down regulation (e + = — e~ =e). 

We generate three sets of simulated data with the following 
configurations. 

• SIMU1: The expectations of DEGs in group A (xfj, 
i= 1,2, . . . ,m\ +wf , j= 1,2, . . . ,n) are set to be a constant 
e for over-expressed genes (i=\, . . . ,m^~) and —e for under- 
expressed genes (i = mf + 1, . . . ,100). Here the effect size e 
takes value in {0.2,0.6,1.0,1.4,1.8}. (mf ,mf) is set to be 
either (60,40) (balanced differential expression structure) or 
(90,10) (unbalanced differential expression structure). The 
lower and upper bounds 0.2 and 1.8 are both estimated from 
the biological data, so are the proportions of over and under- 
expressed genes. For all genes in group B and NDEGs in 
group A, their expectations are set to be 0. We use SIMU1 to 
study the impact of different effect sizes on gene normalization 
procedures when the sample size is fixed and relatively small, e 
is the tuning parameter of these data sets. 

• SIMU2: The expectations of DEGs in group A are set to be 
e + =e = 0.2 and e~ = — e= — 0.2 for over and under- 
expressed genes, respectively, (m^ ,m^) is set to be (60,40) 
and (90,10). For all genes in group B and NDEGs in group A, 
their expectations are set to be 0. The sample size n takes value 
in {10,40,70,100,400}. We use SIMU2 to study the impact of 
different sample sizes on gene normalization procedures when 
the effect size is fixed and relatively small, n is the tuning 
parameter of these data sets. 

• SIMU3: These datasets have the same configuration as 
SIMU2 except that the expectations of DEGs in group A are 
set to be 1.8 and —1.8 for over and under-expressed genes, 
respectively. We use SIMU3 to study the impact of different 
sample sizes on gene normalization procedures when the effect 
size is fixed and relatively large, n is the tuning parameter of 
these data sets. 

We randomly generate 20 sets of data per tuning parameter for 
SIMU-BIO, SIMU1, SIMU2, and SIMU3. We apply normal- 
ization procedures first and then conduct hypothesis tests to obtain 
raw /(-values. After that, we apply multiple testing procedures to 
get adjusted /(-values. We declare a gene to be differentially 
expressed if its adjusted /(-value is less than a prespecified 
significance level 0.05. The estimated average true/false positives 
of each normalization procedure with /-test and BONF MTP are 
presented in Figures 4, 1,2, and 3. To better understand the trade- 
off between true and false positives, receiver operating character- 
istic (ROC) curves of gene selection strategies based on different 
normalization methods are presented in Figure 5. More thorough 
results are presented as Tables S1-S8 in File SI. 
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Figure 4. Number of true (a,c) and false (b,d) positives as functions of sample size (SIMU-BIO). Total number of genes is m = 9005. Total 



numbers of truly differentially expressed genes are m. 



= 546 for balanced structure and m. 



= 734 for unbalanced structure, where ml 



and m x are the numbers of up- and down-regulated genes, respectively, f-test and Bonferroni procedure are applied. Adjusted />-value threshold: 
0.05. 

doi:1 0.1 371 /journal.pone.0099380.g004 



Analyzing Biological Data 

We apply the aforementioned gene selection strategies to detect 
differentially expressed genes across three different microarray 
datasets (HYPERDIP, TEL and TALL). The numbers of 
positives with ?-test and BONF are presented in Figure 6 and 
Table S9 in File SI. 

Most results agree with what we observe in the simulation 
studies. The results are conspicuous in that the numbers of 
detected DEGs become very large when n is large (n = 45 or 
n = 19). Such a large number of positives may indicate the 
associated strategies failed to control the familywise error rate at its 
nominal level (a = 0.05). 

As observed in the simulation studies, gene selection strategies 
with normalization procedures detect more DEGs than those 
without normalization. For TALL vs. TEL, the strategies with 



QUANT and RANK detect more DEGs than those with 
GLOBAL. This observation suggests that the technical noise 
may not be purely additive and is consistent with what we observe 
in SIMU-BIO. Among four normalization procedures, SVA 
produces the largest number of postives and DELTA is the most 
conservative one. Based on our simulation results, we think it is 
reasonable to believe that the <5-sequence method has relatively 
better control of type I errors. 

Discussion 

It is well known that many undesirable systematic variations are 
observed in high-throughput genomic data. There are many 
choices of normalization procedures to remove systematic noise. 
The properties of these normalization procedures are closely 
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Figure 5. Comparing the performance of normalization procedures by receiver operating characteristic curves. Data used: SIM-BIO 
with n = 45 for (a) and n = 79 for (b). Total number of genes is m = 9005. Total numbers of truly differentially expressed genes are +mf =546 for 
balanced structure and wj 1 " +/nj~ =734 for unbalanced structure, where m, + and mf are the numbers of up- and down-regulated genes, 
respectively, f-test and Bonferroni procedure are applied. 
doi:1 0.1 371 /journal.pone.0099380.g005 



related to the structure of differential gene expression and sample 
size. 

In this study, we find that all four normalization procedures can 
reduce the variances and covariances of gene expressions so that 
the statistical power of the subsequent gene selection procedures 
may be improved. However, they also introduce certain biases 
which may cause more Type I errors and/ or reduce testing power 
in certain situations. This bias-variance trade-off is common in 
many different branches of statistics. We found that gene selection 
strategies based on GLOBAL seem to have the best testing power 
for data generated from a Gaussian model. However, when the 
effect size is large, they produce far more false positives than which 
are permitted by the nominal significance level of Type I errors, 
especially when the gene differential expression structure is 
unbalanced (m^ = 90, mj~ = 10). Gene selection strategies based 
on QUANT and RANK still have good control of Type I error 
while retaining reasonable testing power. In addition, sometimes 
they help detect more DEGs in SIMU-BIO, which suggests that 
these two normalizations work better than GLOBAL when the 
technical noise is not entirely additive (in the log-scale). Overall, 



gene selection strategies based on NONE (without normalization) 
have very good control of type I errors, but their statistical power 
are poor and the variability of results (in terms of the standard 
deviations of the true/false discoveries) are larger. One explana- 
tion of this phenomenon is that all normalization procedures 
reduce the variability caused by the afj term, which quantifies the 
array-specific variation, from Model (12), thus increase the signal- 
to-noise ratio of the normalized expressions. It is known [27] that 
large a? term induces high intergene correlation. Consequently, 
when the intergene correlation of non-normalized expressions is 
low, NONE has better statistical power. As a comparison, we 
reduced p = 0.9 to p = 0.75 in SIMU1 and applied the same gene 
selection strategies as before. The true/false positives produced by 
gene selection strategies based on normalized data are almost 
identical to those produced from the original SIMU1 data; but 
NONE has slightly better power. For example, with true effect size 
1.0 and 60 up regulated genes, the mean number of true positives 
detected by ?-test and Bonferroni procedure increases from 84.15 
to 86.95. 



(a) TALL v. TEL 



(b) HYPERDIP v. TEL 
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Figure 6. Number of detected DEGs as a function of sample size, (a): TALL versus TEL; (b) HYPERDIP versus TEL. Total number of genes is 
m = 9005. f-test and Bonferroni procedure are applied. Adjusted p-value threshold: 0.05. 
doi:1 0.1 371 /journal.pone.0099380.g006 
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Among all normalization procedures, we found that DELTA is 
the most robust one. The numbers of false positives produced by 
the <5-sequence are very consistent for all sample sizes and effect 
sizes. Its testing power never reaches 100%, however. As a 
comparison, gene selection strategies without normalization have 
better performance in terms of testing power and Type I error 
control compared with their counterparts with normalization 
when the sample size is large and/ or the effect size is large. 

Compared with other normalization procedures, SVA produces 
very unstable results. Take Figure 4 (SIMU-BIO) as an example, 
the testing power related with SVA is very good (especially for the 
unbalanced data), but it produces far too many false positives in 
several occasions. Similar unstable behavior of SVA can be 
observed in Figures 1, 2, and 3. 

Another notable result is that gene selection procedures based 
on normalizations have better power while retaining less or 
comparable false positives when the gene differential expression 
structure is balanced. 

We also compare the performance of different normalization 
procedures by ROC curves. ROC curves represent the balance of 
true and false positives detected at different significance level and is 
most relevant if the primary goal of expression analysis is to select 
a fixed number of "top genes" instead of using a formal MTP to 
control type I error. In Figure 5, QUANT and RANK have the 
best performance when the differential expression structure is 
balanced; GLOBAL is the clear winner otherwise. NONE is more 
attractive when the differential expression structure is unbalanced. 
DELTA has poor performance compared with other options. 
GLOBAL seems to have the best overall performance for both 
cases. 

Based on these results, our main conclusions can be summarized 
as follows. 

1 . We recommend applying normalization when the sample size 
is relatively small (n< 10 per-group). Failing to do so may lead 
to dismal statistical power and high variability of results. 
Cautions still need to be used in this case because some 
normalization procedures are more susceptible to large 
phenotypic changes and/ or non-additive noise. If unsure, we 
recommend quantile or rank normalization because they are 
more robust to these factors. 

2. We only recommend global normalization if either a) the 
statistical power is too low (too few DEGs are identified) and 
the differential expression structure (in terms of up/down 
regulated genes) is balanced; b) the goal is to select a fixed 
number of "top genes". 

3. For large sample data, we recommend conducting differential 
expression analysis without normalization first. If the statistical 
power is adequate and a reasonably high percentage (>5%) of 
genes are selected as DEGs, we recommend no further 
normalization. This is because if the power is adequate, the 
advantage of variance-reduction provided by normalization 
procedures could be out-weighted by the bias and thus the 
inflation of false positives induced by these procedures. If the 
statistical power is too low, a robust normalization such as rank 
or <5-sequence normalization is recommended. 

4. The ^-sequence method can serve as a robust normalization 
candidate when either sample size is large or dramatic 
phenotypic changes are expected. 

We think similar analysis can be applied to models which 
characterize systematic noise sources in other ways. Though we 
choose to focus on the Affymetrix GeneChip platform throughout 
this paper, our conclusions should be valid for other array 



platforms which require/recommend normalization, such as 
Affymetrix exon-arrays [41,42], Illumina BeadChip arrays [43- 
45], Illumina transcriptome sequencing (mRNA-Seq) data [46], 
Illumina Infinium whole genome genotyping (WGG) arrays [47], 
Solexa/Illumina deep sequencing technology [48], and many 
others. 

In a sense, between-array variation can be considered as a 
special form of batch-effect, in which one "batch" consists of just a 
single array. In this way, several pre-processing procedures that 
are designed to remove batch effects, such as RUV-2 [49], 
ComBat [50], can be used in place of normalization procedures. 
Further investigations are required to fully understand the utility of 
these procedures when used for normalization, especially in large- 
scale studies. 

We focus on post-summarizing normalization procedures in this 
study because it is easier to derive the asymptotic bias and variance 
formula based on a random effect model. In practise, many 
normalization procedures are applied before the summarization 
step, which makes theoretical derivation of the bias-variance trade- 
off difficult. Nevertheless, we think the same principle can be 
applied to those normalization procedures and further investiga- 
tions are warranted. We hope this study can help biological 
researchers choose an appropriate gene selection procedure. 
Understanding both advantages and disadvantages of different 
gene selection strategies may also help the development of new 
normalization procedures, hypothesis tests and MTPs. 

Supporting Information 

File SI Supporting tables and figures. Table SI. The 

impact of different effect sizes e on gene selection strategies when 
the sample size n is fixed and relatively small. Mean (STD) of true 
positives computed from SIMU1 with 20 repetitions are 
reported. Sample size: «=10. Total number of genes: 1000. 
Number of differentially expressed genes: 100. Number of 
permutations for Nstat: 10000. The significance threshold: 0.05. 
Table S2. The impact of different effect sizes e on gene selection 
strategies when the sample size n is fixed and relatively small. 
Mean (STD) of false positives computed from SIMU1 with 20 
repetitions are reported. Sample size: w=10. Total number of 
genes: 1000. Number of differentially expressed genes: 100. 
Number of permutations for Nstat: 10000. The significance 
threshold: 0.05. Table S3. The impact of different sample sizes n 
on gene selection strategies when the effect size e is fixed and 
relatively small. Mean (STD) of true positives computed from 
SIMU2 with 20 repetitions are reported. Effect size: e = 0.2. Total 
number of genes: 1000. Number of differentially expressed genes: 
100. Number of permutations for Nstat: 10000. The significance 
threshold: 0.05. Table S4. The impact of different sample sizes n 
on gene selection strategies when the effect size e is fixed and 
relatively small. Mean (STD) of false positives computed from 
SIMU2 with 20 repetitions are reported. Effect size: e = 0.2. Total 
number of genes: 1000. Number of differentially expressed genes: 
100. Number of permutations for Nstat: 10000. The significance 
threshold: 0.05. Table S5. The impact of different sample sizes n 
on gene selection strategies when the effect size e is fixed and 
relatively large. Mean (STD) of true positives computed from 
SIMU2 with 20 repetitions are reported. Effect size: e= 1.8. Total 
number of genes: 1000. Number of differentially expressed genes: 
100. Number of permutations for Nstat: 10000. The significance 
threshold: 0.05. Table S6. The impact of different sample sizes n 
on gene selection strategies when the effect size e is fixed and 
relatively large. Mean (STD) of false positives computed from 
SIMU2 with 20 repetitions are reported. Effect size: e= 1.8. Total 



PLOS ONE | www.plosone.org 



11 



June 2014 | Volume 9 | Issue 6 | e99380 



Bias-Variance Trade-Off in Normalization Procs 



number of genes: 1000. Number of differentially expressed genes: 
100. Number of permutations for Nstat: 10000. The significance 
threshold: 0.05. Table S7. The impact of different sample sizes n 
on gene selection strategies with simulation based on biological 
data. Mean (STD) of true positives computed from SIMU-BIO 
with 20 repetitions are reported. Total number of genes: 9005. 
Number of permutations for Nstat: 100000. The significance 
threshold: 0.05. Table S8. The impact of different sample sizes n 
on gene selection strategies with simulation based on biological 
data. Mean (STD) of false positives computed from SIMU- 
BIO with 20 repetitions are reported. Total number of genes: 
9005. Number of permutations for Nstat: 100000. The signifi- 
cance threshold: 0.05. Table S9. The numbers of differentially 
expressed genes detected by different selection strategies. Total 
number of genes: 9005. Number of permutations for Nstat: 
100000. The significance threshold: 0.05. Figure SI. Histogram 
of pairwise Pearson correlation coefficients between genes 
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